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Abstract. Mixtures of Gaussian factors are powerful tools for modeling an unob¬ 
served heterogeneous population, offering - at the same time - dimension reduc¬ 
tion and model-based clustering. Unfortunately, the high prevalence of spurious 
solutions and the disturbing effects of outlying observations, along maximum like¬ 
lihood estimation, open serious issues. In this paper we consider restrictions for 
the component covariances, to avoid spurious solutions, and trimming, to provide 
robustness against violations of normality assumptions of the underlying latent fac¬ 
tors. A detailed AECM algorithm for this new approach is presented. Simulation 
results and an application to the AIS dataset show the aim and effectiveness of the 
proposed methodology. 
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5 Concluding remarks 


1 Introduction and motivation 

Factor analysis is an effective method of summarizing the variability between a number of cor¬ 
related features, through a much smaller number of unobservable, hence named latent , factors. 
It originated from the consideration that, in many phenomena, several observed variables could 
be explained by a few unobserved ones. Under this approach, each single observed variable 
(among the p ones) is assumed to be a linear combination of d underlying common factors with 
an accompanying error term to account for that part of variability that is unique to it (not in 
common with other variables). Ideally, d should be substantially smaller than p, to achieve 
parsimony. 

Clearly, the effectiveness of this met hod is limited by its global linearity, as it happens 


for principal components analysis. Hence, 


Ghahramani and Hilton 


199% and 


19971) . 


Tipping and Bishop 


McLachlan and Peel ( 2000a!) solidly widened the applicability of these approaches 


by combining local models of Gaussian factors in the form of finite mixtures. The idea is to 
employ latent variables to perform dimensional reduction in each component, thus providing 
a statistical method which concurrently performs clustering and, within each cluster, local di¬ 
mensionality reduction. 

In the literature, error and factors are routinely assumed to have a Gaussian distribution be¬ 
cause of their mathematical and computational tractability: however statistical methods which 
ignore departure from normality may cause biased or misleading inference. Moreover, it is 
well known that maximum likelihood estimation for mixtures often leads to ill-posed problems 
because of the unboundedness of the objective function to be maximized, which favors the 
appearance of non-interesting local maximizers and degenerate or spurious solutions. 

The lack of robustness in mixture fitting arises whenever the sample contains a certain pro¬ 
portion of data that does not follow the underlying population model. Spurious solutions can 
even appear when ML estimation is applied to artificial data drawn from a given finite mixture 
model, i.e. without adding any ki nd of contamination. Hence, robustified estimation is needed. 


component in 


Fralev and Rafterv 

(1998 

), mixtures of t-distributions in 

McLachlan and Peel 


(119981) . the trimmed likelihood mixture fitting method in 


estimation of contaminated mixtures in 


Nevkov et al 


( 2007l) . the trimmed ML 


Gallegos and Ritter (120091) . and the robust improper 
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ML estimator introduced in iCoretto and Hennigi (20111). among many others. Some important 
applications in fields like computer vision, patte rn recognition, analysis of microarray gene ex¬ 


pressio n data, or tomography (see, for example, 


(2003) and 


Stewart (119991) . 


Campbell et ah 


(119971) . 


Bickel 


Maitra (12001 ). respectively) suggest that more attention should be paid to robust¬ 


ness, because noise in the data sets may be frequent in all these fields of application. 

Different types of constraints have been traditionally applied in Gaussian mixtures of factor 
analyzers, for instance, some authors propose to take a common (diagonal) error matrix (as 


for the Mixtures of Common Factor Anal yzers, denoted 


by MCFA, in 


Baek et all 


2010) or 


to impose an isotropic error matrix Bishop and T ipping. 119981 . This strategy has proven to 
be effec tive in many cases, at the expenses of stronger distributional restrictions on the data. 


In 


McNicholas and Murp hy. 


2008L when analyzing parsimonious mixtures of Gaussians factor 


analyzers models, they realized that equal determinant restrictions give more stab le res ults. For 


avoiding singularities and spurious solutions, under milder conditions, 


G resel in and Ingr ass ia 


(120131) recently proposed to maximize the likelihood by constraining the eigenvalue^ o£the 


covariance matrices, following previ ous wor 


(2012) and 


1985). Furthermore, 


McLachlan and Bean 


Lin et ah 


of 


Ingrassia (2004) and going bac 


2005b . 


t to 


Baek and McLachlan (201T), 


Hathaway 


Steane et al. 


(2014) have considered the use of mixtures of /-analyzers in an attempt 


to make the model less sensiti ve to outliers, but they, too, are not robust against very extreme 


outliers (H enn ig. 


20 04); while 


Fokoue and Titterington (2003 ) proposed a Bayesian approach. 


The purpose of the present work is to introduce an estimating procedure for mixture of 
Gaussian factors analyzers that can resist the effect of outliers and avoid spurious local maxi¬ 
mizers. The proposed constraints can be also used to take into account prior information about 
the scatter parameters. 

Trimming has been shown to be a simple, powerful, flexible and computationally feasible 
way to provide robustness in many different statistical frameworks. The basic idea behind 
trimming here is the removal of a little proportion a of observations whose values would be 
the more unlikely to occur if the fitted model was true. In this way, trimming avoids that a 
small fraction of outlying observations could exert a harmful effect on the estimation of the 
parameters of the fitted model. 

Incorporating constraints in the mixture fitting estimation method moves the mathematical 
problem in a well-posed setting and hence minimizes the risk of incurring spurious solutions. 
Moreover, a correct statement of the problem allows to study the properties of the EM algo¬ 


rithms as in 


Ingrassia and Rocci (2007) and to obtain the desired statistical properties for the 
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esti mators, such as the existen ce and consistency results, as in 


Garcfa-Escudero et al. 


and 


(120081) 


G alle gos and Rit ter (120091) . 


We have organized the rest of the paper as follows. In Section [2] we introduce notation 
and summarize main ideas about Gaussian Mixtures of Factor Analyzers (in the foremost, de¬ 
noted by MFA). Then, in Section [3] we introduce the trimmed likelihood for MFA and we pro¬ 
vide fairly extensive notes concerning the EM algorithm, with incorporated trimming and con¬ 
strained estimation. In Section 0] we discuss the performance of our procedure, on the ground 
of some numerical results obtained from simulated and real data. In particular, we compare the 
bias and MSE of robustly estimated model parameters for different cases of data contamination, 
by Monte Carlo experiments. The application to the Australian Institute of Sports dataset shows 
how classification and factor analysis can be developed using the new model. Section [5]contains 
concluding notes and provides ideas for further research. 


2 Gaussian Mixtures of Factor Analyzers 

The density of the p-dimensional random variable X of interest is modeled as a mixture of 
G multivariate normal densities in some unknown proportions tt, ,... tt (: , whenever each data 
point is taken to be a realization of the following density function, 

G 

/( x 1 e ) = 5^ AV S s) (2- 1 ) 

g =i 

where 0 p (x; /z, X) denotes the /v-variate normal density function with mean vector /j, and covari¬ 
ance matrix X. Here the vector 0 = 0gm{p , G) of unknown parameters consists of the (G — 1) 
mixing proportions n g , the Gp elements of the component means n g , and the ^Gp(p+ 1) distinct 
elements of the component-covariance matrices X r MFA postulates a finite mixture of linear 
sub-models for the distribution of the full observation vector X, given the (unobservable) fac¬ 
tors U. That is, MFA provides local dimensionality reduction by assuming that the distribution 
of the observation X, ; can be given as 

X, = n g + A g Uj 5 + e ig with probability n g (g — 1,..., G) for i — 1,..., n, (2.2) 

where A g is a p x d matrix of factor loadings , the factors Ui p ,..., XJ ng are M (0,1^) distributed 
independently of the errors e ig . The latter are independently A/"(0, v & fl ) distributed, and 'T, ; is 
a p x p diagonal matrix (g = 1 ,,G). The diagonality of is one of the key assumptions 
of factor analysis: the observed variables are independent given the factors. Note that the factor 
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variables XJ ig model correlations between the elements of X,, while the errors e ig account for 
independent noise for X*. We suppose that d < p, which means that d unobservable factors are 
jointly explaining the p observable features of the statistical units. Under these assumptions, 
the mixture of factor analyzers model is given by (12.11) . where the g- th component-covariance 
matrix has the form 

X g = A g A r g + * g (g = 1,..., G). ( 2 . 3 ) 

The parameter vector 6 = 9 M fa(p, d, G ) now consists of the elements of the component means 
H g , the A g , and the \I/ S , along with the mixing proportions 7r g (g = 1,.. ., G — 1), on putting 

kg = 1 - Ei=r K g . 

3 Trimmed Mixtures of Factor Analyzers 

In this section we present the trimmed (Gaussian) mixtures of factor analyzers model (trimmed 
MFA) and we propose a feasible algorithm for its implementation. 


3.1 Problem statement 


We will fit a mixture of Gaussian factor components to a given dataset x = {x-i, x?.x„ | in 


R p by ma ximizing a trimmed mixture lo g-likelihood ('sce lNevkov et al. 


2009 


and 


Garcfa-Escudero et al. 


2007 . 


Gallegos and Ritter 


2014 1 defined as: 

G 


£trim ^ ^ '- ( X /) log 


i— 1 


^ 1 0p( x d dgi A A ^ g)Kg 

.9=1 


(3.1) 


where z(-) is a 0-1 trimming indicator function that tell us whether observation x, is trimmed 
off: z(xj)=0, or not: z(xj)=1 and S g = A g A' g + as in (12.31) . A fixed fraction a of observa¬ 
tions can be unassigned by setting E”=i Z (x,) = n(l — o )]. Hence the parameter a denotes the 
trimming level. As usual, x 1; ..., x n are the realized values of n independent and identically 
distributed random vectors Xi,..., X n with common density given in (12.11) . with component- 
covariance matrices X 9 as in (12.31) for g = 1,..., G. The component label vectors z 1} ..., z„ are 
taken to be the realized values of the random vectors Zi,..., Z n , where, for independent feature 
data, it is appropriate to assume that they are (unconditionally) multinomially distributed, i.e. 
Zi,...,Z n Mult G ( 1; 7r i,..., Tic)- 
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Moreover, to avoid the unboundedness of C trim , we introduce a constrained maximization 
of (13.11) . In more detail, with reference to the diagonal elements {i^ g ,k}k=i,...,p of the noise 
matrices g for g = 1 ,..., G we require that 

i>g U k < c noise ip g2ih for every 1 < k h < p and 1 < g 1 ± g 2 < G (3.2) 


The constant c noise is finite and such that c noise > 1, to avoid the |£J 0 case. This 


constr a int can be seen as an adapta tion to MFA of those introduced in 


42007b, 


Garcfa-Escudero et al 


V1FA in 


Gresclin and Ingrassia 


Ingrassia and Rocci 


(2008) and it is similar to the mild restrictions implemented for 


(120131) . They all go back to the seminal paper of [Hathaway 


1985j). We will look for the ML estimators of Ty, under the given constraints, and this position 


set the maximization problem as a well-defined one, and at the same time discard singularities 
and reduce spurious solutions. 

If {Afc(A)} fc=1 p denote the set of eigenvalues of the p x p matrix A, a second set of 
constraints apply on the product of the loading matrices A g A' by requiring that 


A fc (A 9l A' ) < C( oad A h (A g 2 A' ) for every 1 < k h < d and 1 < g 1 ± g 2 < G. (3.3) 


with ci oad such that 1 < c ioad < +oo. These X k (A g A' rj ) values control the different scatters in 
the reduced subspaces. In fact, these type of constraints are not needed to avoid singularities in 
the target function but they could be useful to achieve more sensible solutions. 

In the foremost, we will denote by 0 C the constrained parameter space for 6 = { 7 r g , //,,, x \>, r; 
A g ',g = 1,...,G} under the requirements (13.21) and (13.31) . 


3.2 Algorithm 


The maximization of Ctrim in (13.11) for 9 e @ c is not an easy task, obviously. We will give 
a feasible algorithm obtained by combining the Alternating Expectation-Conditional Maxi¬ 


mization algorithm (AECM) for MFA with that (wit 


in 


Garcfa-Escudero et al. 


(2014) (see, also, 


Fritz et al.. 


1 trimming and constraints) introduced 


20131) . 


The AECM is an extension of the EM, suggested by the factor structure of the model, that 
uses different specifications of missing data at each stage. The idea is to partition the vector of 
parameters 6 = (6\, 0' 2 )' in such a way that C trirn is easy to be maximized for G\ given B > and 
viceversa, replacing the M-step by a number of computationally simpler conditional maximiza¬ 
tion (CM) steps. In more detail, in the first cycle we set 6 1 = {n g , g — 1,..., G} and the 
missing data are the unobserved group labels Z — (z[,..., z' n ), while in the second cycle we set 
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d 2 = {\ g , v E f ry : g — 1..... G} and the missing data are the group labels Z and the unobserved 
latent factors U = (Un,..., U„g). Hence, the application of the AECM algorithm consists 
of two cycles, and there is one E-step and one CM-step alternatively considering 61 and 0 2 
in each cycle. Before describing the algorithm, we remark that the unobserved group labels 
Z are considered missing data in both cycles. Therefore, during the Z-th iteration, we shall de¬ 
note by z- '+' /2 ^ and 1J the conditional expectations at the first and second cycle, respectively. 


The algorithm has to be run multiple times on the same dataset, with different starting 
values, to prevent the attainment of a local, rather than global, maximum log-likelihood. In 
each run it executes the following steps: 


1 Initialization: 

Each iteration begins by selecting initial values for 6 ^ where 9 ® = ( 7 t g 0 \ A^\ 

^g°^;g = 1,..., G). Inspired from results obtained in a series of extensive test experi¬ 
ments about initialization strategies (see lMaitrat 2009), and aiming to allow the algorithm 
to visit the entire parameter space, we randomly select p +1 units (without replacement) 
for group g from the observed data x = {xj}j = i r .. ;n . In this way we obtain a subsam¬ 
ple X 9 that we arrange in a (p + 1) x p matrix, and its sample mean will be the initial 
p 9 0) .Additionally, based on these p+1 observations, we developed a new ad hoc approach 
for providing an initialization procedure for 'ly' 11 and A g 0) , to deal with the possible ex¬ 
istence of gross outlying observations among the subsamples, which could inflate dis- 
proportionally some of their eigenvalues. The rationale under our procedure is, as usual, 
to fill in randomly the missing information in the complete model through random sub¬ 
samples and, then, to estimate the other parameters. The missing information here are 
the factors Ui,..., u n , which, under the assumptions for the model, are independently 
Af(0, Id) distributed. We consider model (12.21) in group g as a regression of X, with in¬ 
tercept /i g , regression coefficients given by A g , where the explanatory variables are the 
latent factors U ig , and with regression errors e ig . Hence we draw p + 1 random observa¬ 
tions from the d-variate standard Gaussian to fill a (p + 1) x d matrix U 9 . Then we set 
A<°> = (EfiU 9 ) -1 EflX 9 where X 9 is obtained by centering the columns of the X 9 ma¬ 
trix. To provide a restricted random generation of T\ ; , we compute the (p + 1) x p matrix 
E g = x 9 — A| y 0) U 9 , and we set the diagonal elements of V E / II) equal to the variances of the 
p columns of the s g matrix. We repeat this for g = 1..... 6' and if the obtained matrices 
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Aj ; 0) and do not satisfy the required constraints (13.21) and (13.31) . then the constrained 
maximizations described in step 2.4 must be applied. Finally, weights ..., k'q in the 
interval (0,1) and summing up to 1 are randomly chosen. 


2 Trimmed AECM steps: 

The following steps 2.1-2.4. are alternatively executed until convergence (i.e. ||z^ +1 ) 


„( 0 | 


< e) for a small constant e > 0 or until reaching a maximum number of iterations 


Maxlter. The implementation of tri mming is related to the “concentration ” steps applied 


in high-breakdown robust methods (IRousseeuw and Van Driessen . 


1999b . Trimming is 


performed along the E-steps, while constraints are enforced during the second cycle CM 
step. 


2.1 First cycle. E-step: 

Here 6\ = { 7 r 9 , p g ; g = 1.....G} and the missing data are the unobserved group 
labels z = (z [,..., z' n ). The E-step on the first cycle on the (l + l)-th iteration 
requires the calculation of 

Ql (Ol ;# (,) ) =^Q(l)[£'trim{Q 1) |x] , 

which is the expected trimmed complete-data log-likelihood given the data x and 
using the current estimate 0 l: for 6. In practice it requires calculating E^a, \Z , fl |x 
and usual computations show that this step is achieved by replacing each z ig by its 
current conditional expectation given the observed data Xj, that is we replace z ig by 
. I//2 ' 1 , where the latter is evaluated as follows. Let us define 

and 

G 

Di = D (xj; 6 {l) ) = ^D g (xf, 6 {l) ) , for i = 1,..., n. 

9 =1 

After sorting these n values, the notation < .... < is adopted. Let us 
consider the subset of indices / C {1, 2,..., n} defined as 

I = {i : -D(j) > D([ na ])}. (3.4) 
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To update the parameters, only the observations with indices in / will be taken into 
account. In other words, we are tentatively discarding the proportion a of observa¬ 
tions with the smallest values. 

Then, set 


(m/2) 

ig 


Dgfad®) 
£>(x*; 0 ( 0 ) 


for i E I 


0 for i ^ I. 


Note that, for the observations with indices in /, t^ + 1//2) are the “posterior probabil¬ 
ities” often considered in standard EM algorithms applied when fitting MFAs. But, 
unlike the standard EM algorithms, the r^ 1 l/2) (and consequently the z ig ) for the 
discarded observations are set to 0. 


2.2 First cycle. CM-step: This first CM step requires the maximization of Q\(0\: 9 ,l> ' / 
over 9, with 9 2 held fixed at We get 0^ +1) by updating n g and p g as follows 


n (1+ 1/2) 


(i+1) - 


and 


where n\ 


71 


4 +1) = 


\ 11 -r 

Z^z=1 


l 9 


[n(l - a)] 

v^n (t+1/2)^ 
2-*i=l 1 ig 

( 1 + 1 / 2 ) 

rig 


— \ 11 J- 

L^i =1 'i 


*9 


, for g — 1,..., G. According to notation in 


McLachlan and Peel 


(2000bj), we set 9 {l+l,2) = (0? +1) , 9 ( 2 l) 


2.3 Second cycle. E- step: 

Here we consider 9 2 = {(A ff , \l/ 9 ), g = 1,..., G}, where the missing data are 
the unobserved group labels Z and the latent factors U. Therefore, the trimmed 
complete-data log-likelihood in this second cycle may be written as 

n G 

ttrim^Oo) = ^2 2 (xi) log ^ [<j) p (x*; ^ +1) - \ g u ig , if? g ) cj) d (u ig ; 0, I d ) 2 l+1) ] ■ 
i= 1 9=1 

The E-step on the second cycle on the Z-th iteration requires the calculation of the 
conditional expectation of C tr im: 2 , given the observed data x and using the current 
estimate 0 (;+1 / 2 ( for 9, i.e. 

Q 2 (0 2 ) 9 {l + 1 ^ ] ^J = E^(i+i/2) [C tr jm:2(02)| x ] • 
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In addition to updating the posterior probabilities E«(i+i/ 2 ) [Z i9 |x] by performing a 
concentration step and replacing each z ig by the new values z:^ 1 = rf g +V) (and con¬ 
sequently rig +1) = T ig +1 ' > ’ f° r 9 — 1, • ■ ■, G, as previously done in step 2.1), 

this leads to evaluate the following conditional expectations: E^(i+i/ 2 ) [Zt ff Uj 9 |x] 
and E«(!+i/ 2 ) [Z ig \J ig \J' ig \x\. Recalling that the conditional distribution of \J ig given 
X; is 


Ui 9 |x« ~ Af ( 7 g (x 4 - n g ), I q - 7<A) 
for i — 1,..., n and g — 1,..., G with 




auax + * 



we obtain 


Ert((+!/2) [^igUjg|Xj 

E 0 ( 1 + l/2) {z ig u is uy Xi 


z£ fl) 7? ) IX,: 


*3 


= z. 


0 + 1 ) 


As 


(0 ("V, - 


= 2 , 


*3 

(W-iWi) 




*3 


*3 


where we set 

7® = < (Af'AfH-*®)- 1 

3,® = I, - 7®A<‘» + 7< W| (x, - M<‘ +1) ) (x, - M< i+1) )'7 ?'■ 


2.4 Second cycle. CM-step for constrained estimation of A g and ^ g : 

Here our aim is to maximize Q2 (0->: 6f ,j j over 0, with 0\ held fixed at f?[ /+1) . After 
some matrix algebra, this yields the updated ML-estimates 

A - ed+O-G)' rw(0l-i 

ly 9 ~ °g 1 g 1“3 J 

= diag {sj +1 > - Aj +1) 7? } S? +1) } 

where we denote by S|/ ’ h the sample scatter matrix in group g, for g — 1 ,..., G 

n 

S<‘ +1 » = (l/n' ,+1 >) J2 4 +1) (x. - 4«>) (x, - 4 +1 >)'. 

Z=1 

Along the iterations, due to the updates, it may happen that the A g matrices do 
not belong to the constrained parameter space © c . In case that the eigenvalues of 
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the A g A' ; matrices do not satisfy the required constraint (13.31) . we obtain the ML 
solution in © c by projecting the unconstrained optimum into 0 C . To this aim, the 
singular-value decomposition of A g A' ; = L' ; E g L g is considered, with L g being an 
orthogonal matrix and E g = diag(e ff i, e g2 , •••, e gd ) a diagonal matrix (notice that 
some of these e g k may be equal to 0 if A g A' g is not full rank). Truncated singular 
values are then defined as 


[e g k]m = mm(ci oad ■ m, (max(e gk , m))), for k = 1,..., d and g = 1,..., G, 


and m being some threshold value. The loading matrices are finally updated as 
A^ +1) such that A^ +1) [Aj +1) ]' = L^E*L g with 

E* = diag ([ 

Cgl]m op t) [ e g2]m opt ) ■■■> [Cgd]m op t) 


and m opt minimizing the real valued function 

G d , 

fload(m ) = ^2 n g +1) Y ( l0g ( \- e 9k]m ) + 

5=1 k =1 ' 


e gk 

[' e gk\r, 


(3.5) 


Fritz et al. 


(12013b 


It may be mentioned here, in passing, that Proposition 3.2 in 
shows that m opt can be obtained by evaluating 2dG + l times the real valued function 
fioad{m) in (123]). 

Given the A^ +1) , we obtain the matrices 


* a = diag{s<«>- , a ~ a 
which may not necessarily satisfy the required constraint (13.21) . In this case, we set 


bP g ,k]m = min(c noise • m, ma m)), for k — 1,... ,d; g — 1,... ,G, 


and fix the optimal threshold value m opt by minimizing the following real valued 
function 


fnoise(m) ^ ^g +l) ( l0g (1^9,k]m) + 


As before, in 


Fritz et al. 


fc=i 


g,k 




g ,k\m 


(3.6) 


(120131) it is shown that m opt can be obtained in a straightful 
way by evaluating 2 pG + 1 times / no i se (m) in (13.61) . Thus, V F^ +1 ' ) is finally updated 
as 


^g ^ — diag(['0 9) i] mopt , ..., ['0fl,p]m opt )- 

It is worth to remark that the given constrained estimation provides, at each step, 
the parameters \P g and A g maximizing the likelihood in the constrained parameter 
space @ c . 
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3 Evaluate target function: After applying the trimmed and constrained EM steps, and 
setting z(xj) = 0 if i E I and z(xj) = 1 if i ^ I, the associated value of the target 
function (13.11) is evaluated. If convergence has not been achieved before reaching the 
maximum number of iterations Maxlter, results are discarded. 


The set of parameters yielding the highest value of the target function (among the multiple runs) 
and the associated trimmed indicator function z are returned as the final output of the algorithm. 
In the framework of model-based clustering, each unit is assigned to one group, based on the 
maximum a posteriori probability. Notice, in passing, that we do not need a high number of 
initializations neither a high value for Maxlter, as we will see in Section |4j 


4 Numerical studies 

In this section we present numerical studies, based on simulated and real data, to show the 
performance of the constrained and trimmed AECM algorithm with respect to unconstrained 
and/or untrimmed approaches. 


4.1 Artificial data 


We consider here the following mixture of G components of ("/-variate normal distributions. 
To perform the estimation, we consider 10 different random initial clusterings to initialize the 
algorithm at each run, as described in the previous section, and we retain the best solution. The 


needed routines have been written in R-code (R Team, 
upon request. 


2013 1. and are available from the authors 


Mixture: G = 3, d = 6, q = 2, N = 150. 


The sample has been generated with weights 7r 
parameters: 

Mr = (0,0,0,0,0,0)' 

M 2 = (5,5,5,5,5,5)' 

M 3 = (10,10,10,10,10,10)' 


(0.3, 0.4, 0.3)' according to the following 

’S'l = diag(0.1,0.1,0.1,0.1,0.1,0.1) 

M/ 2 = diag(0.4, 0.4, 0.4,0.4,0.4, 0.4) 

*3 = diag(0.2, 0.2, 0.2,0.2,0.2, 0.2) 
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Figure l4~T1 shows a specimen of randomly generated data from the given mixture. 
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Figure 4.1: A specimen of 150 data points generated from the mixture (the first two coordinates are plotted, 
groups in black, red and green) 

Our analysis begins by running the AECM algorithm on the generated sample, and consid¬ 
ering the following six settings, namely: 

51. a ’’virtually” unconstrained approach (i.e. c no ise = cioad = 10 10 ) without trimming (a = 0), 

52. an adequate constraint on \E' g , no constraint on A g (c no i S e = 5, ci oa d = 10 10 ) and no trimming (a = 0), 

53. adequate constraints on \& g and A g (c no ise = 5, ci oa d = 3), and still no trimming (a = 0), 

54. a ’’virtually” unconstrained approach (i.e. c no i se = Ci oa d = 10 10 ) with trimming (a = 0.06), 

55. an adequate constraint on no constraint on A g (c no i se = 5, ci oa( j = 10 10 ), with trimming (a = 0.06), 

56. adequate constraints on \& g and A g (c noise = 5, Ci oac j = 3), with trimming (a = 0.06) 

It is worth noticing that when setting c noise = 10 10 we want to discard singularities, and allow 
the estimation to move in a wide parameter space that contains the global maximum, among 
several local ones. In this situation the estimation could incur into spurious solutions. We expect 


13 











that the algorithm improve its performances when giving the ’’right” constraints. The adequate 
constraints can by evaluated by obtaining the maximum ratio among the eigenvalues of and 
among the singular values of A g . We have that the singular values of Ai are (3.069,1.528), of 
A 2 are (3.777, 1.873) and of A 3 are (2.091, 1.729) hence we derive ci oa d > 2.471; while the 
diagonal elements of \P g are 0.1, 0.4, and 0.2 so that c noise > 4. We applied also trimming 
to the artificially generated data, to see the effect of an unneeded elimination of the outermost 
points in the model estimation and subsequent classification. We evaluate the performance of 
the algorithm by calculating the average misclassification error r/, over 100 repetitions of the 
estimation procedure. The misclassification error is defined as the relative frequency of points 
of the sample erroneously labeled, taking into account that noise and pointwise contamination 
(when added) should be identified, as they virtually do not belong to the three groups. We 
see that the algorithm, applied without trimming, give a perfect classification with and without 
constraints, due to fact that estimation is performed along 10 random initializations. While 
adding trimming, the misclassification error is almost equal to the trimming level (as expected). 
The results are summarized in Table 14. ll Moreover, we observed that the other parameters, such 
as the means fi g , and \P g , A g for g = 1,2,3 are close to the values from which the data have 
been generated. 

Table 4.1: Misclassification error ij (average on 100 repetitions of the estimation procedure) of the AECM 
algorithm with settings S1-S6, applied on the artificially generated data 



si 

S2 

S3 

S4 

S5 

S6 

Cnoise 

to 10 

5 

5 

to 10 

5 

5 

Qoad 

to 10 

10 10 

3 

10 10 

to 10 

3 

a 

0 

0 

0 

0.06 

0.06 

0.06 

v 

0.33% 

0.04% 

0.00% 

6.45% 

6.13% 

6.00% 


Afterwards, we have considered 3 different scenarios. 

D+N: 10 points of uniform noise have been added around the data, 

D+PC: 10 points of pointwise contamination have been added outside the range of the data, 
D+N+PC: both the uniform noise and the pointwise contamination have been added to the data. 

We applied the algorithm to the different datasets in the six previous settings S1-S6 (i.e. with/without 
constraints and trimming) and we calculated the misclassification error. Results in the first row 
of Table 14.21 show that trimming is very effective to identify and discard noise in the data, and 
constraints contribute slightly, to reach perfect classification. The misclassification error (re- 


14 










ported in the second row of Table 14.21) shows that we need trimming and constraints to achieve 
a very good behavior of the algorithm. Noise and pointwise contamination could cause very 
messy estimation, as it is seen in the first three columns of the Table, where we only rely/do 
not rely on constraints. Further, we observe that trimming is a good strategy when dealing with 
uniform noise, but it is not able to resist pointwise contamination. If we want to be protected 
against all type of data contamination we do need both the use of constrained estimation and 
trimming. 


Table 4.2: Misclassification error // (average on 100 repetitions of the estimation procedure) of the AECM 
algorithm with settings S1-S6, applied on different data sets 



SI 

S2 

S3 

S4 

S5 

S6 

Cnoise 

10 10 

5 

5 

10 10 

5 

5 

Cioad 

to 10 

10 10 

3 

10 10 

10 10 

3 

a 

0 

0 

0 

0.06 

0.06 

0.06 

D+N 

0.3348 

0.4856 

0.5357 

0.0305 

0.0033 

0.0000 

D+PC 

0.2811 

0.2659 

0.2837 

0.0465 

0.0071 

0.0031 

D+N+PC 

0.4035 

0.5299 

0.5294 

0.0918 

0.0124 

0.0064 


4.1.1 Properties of the estimators for the mixture parameters 

Now, we want to perform a second analysis on this artificial data and our main interest here 
is in assessing the effect of trimming and constraints on the properties of the model estima¬ 
tors. Namely, we will estimate their bias and mean square error when the data is affected by 
noise and/or pointwise contamination. We will consider the same four scenarios we considered 
before, i.e.: 

D: the artificially generated data, 

D+N: the data with added noise, 

D+PC: the data with added pointwise contamination, 

D+N+PC: the data with added noise and pointwise contamination. 

We apply in the four scenarios the algorithm for estimating a trimmed MFA model, exploring 
the six settings on c no i se , ci oa d and a that have been shown in Table 14.21 For sake of space, we 
report our results only on the more interesting cases. 

The benchmark of all simulations is given by the results that we obtain on artificial data 
drawn from a given MFA without outliers, and they are shown in the first column of Table 14.31 
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In each experiment, we draw 1000 times a sample of size n = 150 from the mixture described 
at the beginning of this Section, and we estimate the model parameters for the trimmed MFA 
using the algorithm presented in the previous Section [3T2l We set c noise = ci oad = 10 10 and 
a = 0 for this first case, as no outliers are added to the samples. 

Notice that the considered estimators in each component are vectors (apart from TT g which 
are scalar quantities, for g = 1,..., G). We are interested in providing synthetic measures of 
their properties, such as bias and mean square error (MSE). As usual, let T be an estimator for 
the scalar parameter t, then the bias of T is given by bias(T ) = E(T) — t, i.e. it is the signed 
absolute deviation of the expected value E(T) from t. Therefore, we would have 6 biases for 
each component of the mean ji g , 6 for diag(T fl ) and 12 for A g . On the other side, MSE is 
defined as a scalar quantity, namely E(|T — t| 2 ) = trace (Var(T)) + bias(T) 2 , also for vector 
estimators. Hence, we adopted a synthesis of each parameter biases by considering the mean of 
their absolute values on each component. Below the bias, in Table 14.31 we provide the MSE in 
parenthesis. 

Then, the second experiment consists in drawing 1000 samples as before, and adding 10 
points of random uniform noise to each of them; the bias and mean square error for the model 
estimators increase dramatically, with c noise = ci oad = 10 10 and a = 0 (results are displayed in 
second column of Table 14.31) . On the other hand, results go back to the same order of magnitude 
as the benchmark if we impose c noise = 5, ci oad = 3 and a = 0.06, as it is shown in the second 
column of Table 14.41 

The third experiment is based on 1000 samples, with 10 points of pointwise contamination 
randomly added. We observed a huge increase of the bias and mean square error for the model 
estimators, without appropriate constraints and level of trimming (see results in third column of 
Table 14.31) . but whenever we run the algorithm with c noise = 5, Ci oad = 3 and a = 0.06 we came 
back to results very close to the benchmark, shown in the third column of Table l4~4l 

The fourth experiment has been developed by considering added random noise and point- 
wise contamination to the 1000 drawn samples. The results on bias and mean square error for 
the case of estimating the trimmed MFA with c noise = ci oad = 10 10 and a = 0, in the fourth 
column of Table 1431 show the harmful effects of distorted inference. On the other side, when 
we applied reasonable constraints c noise = 5, Ci oad = 3 and a trimming level a = 0.12 to cope 
with the added outliers, we got the results in the fourth column of Table 14.41 We see that robust 
inference allows reduced bias and mean square error, even in case of both sparse outliers and 
concentrated leverage points. 
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Finally the scheme of simulations on the four data sets, in the six estimation settings, have 
been repeated considering a triple sample size (n = 450). All results are summarized in Table 
14.31 when c noise = ci oa d = 10 10 and a = 0 to be compared with results in Table l4~4l in which 
appropriate constraints and trimming have been used along the estimation, to see the improved 
properties of the estimators when the sample size increases. 


Table 4.3 : Case without trimming and constraints. 

Bias as the sum of absolute deviations, followed by MSE (in parentheses) of the parameter estimators 
i, for i = 1,2,3 when dealing with different datasets. The column labels refer to the different 
scenarios, namely D stays for data, D+N stays for data and added noise, D+PC stays for data and pointwise 
contamination, D+N+PC stays for data with added noise and pointwise contamination. 



D 

D+N 

D+PC 

D+N+PC 

3*(D+N+PC) 

n 

0.0013 

(0) 

0.124 

(0.0166) 

0.1219 

(0.0153) 

0.1893 

(0.038) 

0.1649 

(0.0312) 

f 2 

0.0065 

(0) 

0.0877 

(0.0089) 

0.1359 

(0.0189) 

0.324 

(0.1072) 

0.2097 

(0.048) 

t 3 

0.0053 

(0) 

0.2118 

(0.0461) 

0.2579 

(0.067) 

0.1347 

(0.0204) 

0.0448 

(0.006) 

Ai 

0.018 

(0.305) 

1.506 

(31.736) 

1.955 

(39.371) 

11.534 

(659.863) 

13.093 

(1140.549) 

M 2 

0.006 

(0.497) 

5.15 

(345.478) 

5 87 

(133.962) 

1.845 

(99.898) 

3.728 

(207.869) 

M 3 

0.059 

(0.712) 

11.651 

(729.452) 

2.63 

(17.135) 

11.949 

(867.905) 

8.159 

(790.799) 

vj>i 

0.013 

(0.027) 

0.435 

(11.373) 

0.043 

(0.054) 

0.564 

(115.608) 

1.274 

(272.187) 

vj/2 

0.066 

(0.172) 

4.622 

(1520.318) 

0.203 

(0.79) 

3.919 

(651.607) 

9.339 

(1819.385) 

4r 3 

0.239 

(2.039) 

15.986 

(5634.188) 

0.38 

(1.891) 

14.736 

(4557.463) 

25.429 

(10196.275) 

At 

0.534 

(82.817) 

0.534 

(82.817) 

0.534 

(96.75) 

0.498 

(300.57) 

0.522 

(361.565) 

^2 

0.608 

(172.601) 

0.608 

(172.601) 

0.551 

(86.747) 

0.642 

(304.718) 

0.653 

(633.379) 

A-3 

0.335 

(404.326) 

0.335 

(404.326) 

0.354 

(53.063) 

0.373 

(284.401) 

0.341 

(326.672) 


The distributions of the estimators for the model parameters can be represented through 
some box plots, and some of them are shown in Figure l4~2l namely with reference to 713 (upper 
panel), ft 3 [1,1] (second panel), 'T 3 [l, 1] (third panel) and A 3 [l, 1] (bottom panel). We can see, 
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Table 4.4: Case with trimming and constraints. 

Bias as the sum of absolute deviations, followed by MSE (in parentheses) of the parameter estimators 
A j, for i = 1,2,3 when dealing with different datasets. The column labels refer to the different 
scenarios, namely D stays for data, D+N stays for data and added noise, D+PC stays for data and pointwise 
contamination, D+N+PC stays for data with added noise and pointwise contamination. 



D 

D+N 

D+PC 

D+N+PC 

3*(D+N+PC) 

fl 

0.0151 

(0.0002) 

0.0002 

(0) 

0.0011 

(0) 

0 

(0) 

0.0006 

(0) 

h 

0.0195 

(0.0004) 

0.0014 

(0) 

0.0011 

(0) 

0.0034 

(0) 

0.0006 

(0) 

T3 

0.0044 

(0) 

0.0012 

(0) 

0.0001 

(0) 

0.0034 

(0) 

0.0001 

(0) 

Ml 

0.006 

(0.117) 

0.010 

(0.159) 

0.017 

(0.215) 

0.006 

(0.226) 

0.001 

(0.038) 

M 2 

0.006 

(0.190) 

0.004 

(0.219) 

0.002 

(0.165) 

0.007 

(0.703) 

0.009 

(0.158) 

M3 

0.008 

(0.177) 

0.020 

(0.302) 

0.004 

(0.154) 

0.062 

(0.787) 

0.018 

(0.231) 

4»i 

0.013 

(0.010) 

0.029 

(0.025) 

0.026 

(0.024) 

0.032 

(0.035) 

0.022 

(0.009) 


0.066 

(0.089) 

0.044 

(0.082) 

0.045 

(0.081) 

0.046 

(0.102) 

0.042 

(0.058) 

4r 3 

0.066 

(0.108) 

0.072 

(0.158) 

0.075 

(0.154) 

0.076 

(0.178) 

0.075 

(0.158) 

Ai 

0.516 

(13.168) 

0.512 

(14.828) 

0.516 

(12.992) 

0.546 

(14.711) 

0.524 

(13.674) 

A-2 

0.568 

(11.049) 

0.569 

(12.311) 

0.569 

(12.036) 

0.586 

(13.875) 

0.578 

(12.832) 

a 3 

0.330 

(9.377) 

0.353 

(11.164) 

0.354 

(12.259) 

0.341 

(13.069) 

0.342 

(13.064) 


in a direct comparison, that the estimation algorithm with adequate trimming and constraints 
is able to resist all type of outlying data. In each panel, the first boxplot on the left provides 
the benchmark of the following five ones, as it shows the distribution of the estimator when the 
data has been drawn from the mixture. The second boxplot (from left to right) in each panel 
shows the distribution of the estimator when we employ constraints and trimming along the 
estimation on data and added uniform noise. The third boxplot refers to the case in which we 
deal with data and pointwise contamination, and the good results are obtained because we are 
employing constraints and trimming. The fourth box plot has been obtained when considering 
both noise and pointwise contamination, and robust estimation. The fifth box plot shows the 
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effects of noise and pointwise contamination when the estimation procedure does not employ 
constraints and trimming. Finally the sixth box plot in all panels reports the case of robust 
estimation performed on a triple sample size, still with noise and pointwise contamination. 


4.2 Real data: the AIS data set 


As an illustration, we apply the proposed technique to the Australian Institute of Sports (AIS) 
da ta, which is a famous benc hmark dataset in the multivari ate literature, originally reporte d 


by iCook and Weisbergl (119941) and subsequently analyzed by 


Azzalini and Dalla Valle (119961). 


among many other authors. The dataset consists of p = 11 physical and hematological measure¬ 
ments on 202 athletes (100 females and 102 males) in different sports, and is available within 
the R package sn (Azzalini, 2011). The observed variables are: red cell count (RCC), white cell 
count (WCC), Hematocrit (He), Hemoglobin (Hg), plasma ferritin concentration (Fe), body 
mass index, weight/height 2 (BMI), sum of skin folds (SSF), body fat percentage (Bfat), lean 
body mass (LBM), height, cm (Ht), weight, kg (Wt), a part from Sex and kind of Sport. A 
partial scatterplot of the AIS dataset is given in Figure 14.31 and Table 14.51 provides summary 
information. 


Table 4.5: Summary information for the AIS dataset 


female athletes 


male athletes 



min 

Ql 

Me 

Q3 

max 

mean 

min 

Ql 

Me 

Q3 

max 

mean 

RCC 

3.8 

4.2 

4.4 

4.5 

5.3 

4.4 

4.1 

4.9 

5.0 

5.2 

6.7 

5.0 

WCC 

3.3 

5.8 

6.7 

8.0 

13.3 

7.0 

3.9 

6.0 

7.1 

8.4 

14.3 

7.2 

He 

35.9 

38.3 

40.6 

42.3 

47.1 

40.5 

40.3 

44.2 

45.5 

46.8 

59.7 

45.6 

Hg 

11.6 

12.7 

13.5 

14.3 

15.9 

13.6 

13.5 

14.9 

15.5 

15.9 

19.2 

15.6 

Fe 

12.0 

36.0 

50.0 

71.5 

182.0 

57.0 

8.0 

55.0 

89.5 

123.5 

234.0 

96.4 

BMI 

16.8 

20.3 

21.8 

23.4 

31.9 

22.0 

19.6 

22.3 

23.6 

25.2 

34.4 

23.9 

SSF 

33.8 

59.3 

81.8 

107.4 

200.8 

87.0 

28.0 

37.5 

47.7 

58.1 

113.5 

51.4 

Bfat 

8.1 

13.2 

17.9 

21.4 

35.5 

17.8 

5.6 

7.0 

8.6 

10.0 

19.9 

9.3 

LBM 

34.4 

51.9 

54.9 

59.4 

73.0 

54.9 

48.0 

68.0 

74.5 

80.8 

106.0 

74.7 

Ht 

148.9 

171.0 

175.0 

179.7 

195.9 

174.6 

165.3 

179.7 

185.6 

191.0 

209.4 

185.5 

Wt 

37.8 

60.1 

68.1 

74.4 

96.3 

67.3 

53.8 

73.9 

83.0 

90.3 

123.2 

82.5 


Our purpose is to provide a model for the entire dataset, and classify athletes by Sex. Let 
us begin our analysis by fitting a mixture of multivariate Gaussian distributions, using Mclust 
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Figure 4.2: Boxplots of some of the simulated distributions of the mixture parameters: 

the simulated distribution of 773 , estimator for 7 T 3 = 0.3 (upper panel); 

the simulated distribution of /X 3 [ 1 ], estimator for ^[l] = 10 (second panel from above); 

the simulated distribution of vI/ 3 [ 1 , 1 ], estimator for ^3 [ 1 , 1 ] = 0.2 (third panel from above); and 

the simulated distribution of A 3 [l, 1 ], estimator for A 3 [l, 1 ] = 0.1 (fourth panel). 

The labels on the horizontal axis refers to the six settings, namely “D” stays for data, “D+N” stays for data and 
added noise, “D+N+PC” stays for data with added noise and pointwise contamination, while “c + t” has been 
added to denote the cases in which the estimation has been performed using constraints and trimming. 
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Figure 4.3: Scatterplot of some pairs of the AIS variables (female data in red, male in blue) 

package in R. The routine mclustBIC fits a set of normal mixture models, on the base of the 
parameters you set in its call. We considered from 1 to 9 components in the mixture and different 
patterns for the covariance matrices, from the more constrained homoscedastic model, to the 
more general heteroscedastic one. After the estimation, mclustBIC provides a model selection 
procedure based on the BIC, a well-known penalized likelihood criterium. In Figure 14.41 the 
BIC values for each kind of model are shown, and for each choice of the number of mixture 
components. The three letters in the acronym of the models stand respectively for the volume , 
the shape and orientation of the ellipsoids of equal probability of the components, which could 
be Equal (hence E) or Variable (V) across the components. Notice that the shape may also 
be Isotropic (hence the letter I denotes spherical ellipsoids), in this case also the orientation is 
the same. Hence, we see that Mclust suggests an EEV model (ellipsoidal, equal volume and 
shape, different orientation of the component scatters) with G = 2 components, providing the 
highest BIC value, i.e. BIC = —10251.6. Now, if we employ the model suggested by Mclust 
to classify AIS data, we obtain 18 misclassified units, i.e., a misclassification error equal to 
18/202 = 9.4%. In Figure l4~4l we can see the classification results. Surely this is not an easy 
dataset for classification, due to the apparent class overlapping we saw in the first scatterplot in 
Figure 14.31 

To improve the classification, we may exploit the conjecture that a strong correlation ex- 
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Figure 4.4: The classification of AIS data obtained through the best model from Mclust (left panel, with female 
data in red, male in blue, misclassified units as black circlecrosses), and the graphical tool for model selection 
(right panel) 

ists among the hematological and physical measurement. Therefore we fit mixtures of factor 
analyzers, assuming the existence of some underlying unknown factors (like nutritional status, 
hematological composition, overweight status indices, and so on) which jointly explains the 
observed measurements. Through the underlying factors, we aim at finding a perspective on 
data which disentangle the overlapping components. To avoid variables having a greater impact 
in the model (which is not affine equivariant) due to different scales, before performing the es¬ 
timation, the variables have been standardized to have zero mean and unit standard deviation. 
We begin by adopting the pGmm package from R, that fits mixtures of factors analyzers with 
patterned covariances. Parsimonious Gaussian mixtures are obtained by constraining the load¬ 
ing A j and the errors to be equal or not among the components. We employed the routine 
pGmmEM, considering from 1 to 9 components, and number of underlying factors d ranging 
from 1 to 6, with 30 different random initialization, to provide the best iteration (in terms of 
BIC) for each case. The best model is a CUU mixture model with d = 4 factors and G = 3 
components, with BIC = —3127.424. CUU means Constrained loading matrices A g = A and 
Unconstrained error matrices g = uj g A g , where A g are normalized diagonal matrices and u g 
is a real value varying across components. Using this model to classify athletes, we got 109 
misclassified units and we discarded it. 
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Figure 4.5: The classification of AIS data obtained through the best UUU model from pGmm with G = 2 and 
d = 4 (female data in red, male in green, misclassified units in black) 

As a second attempt using pGmm, we estimated a UUU model by setting G — 2 com¬ 
ponents, and d = 6. The acronym UUU means that we leave the estimation of loadings A j 
and errors T ; unconstrained. Based on 30 random starts, the best UUU model had BIC = 
—3330.306, and the consequent classification of the AIS dataset produces 72 misclassifies units 
(misclassification error=35.6%), that are visualized in Figure @31 

Finally, we want to show the performances of our trimmed and constrained estimation for 
MFA on the AIS data. All the results are generated by the procedure described in Section 13.21 
are based on 30 random initializations and returning the best obtained solution of the parame¬ 
ters, in terms of highest value of the final likelihood. 

Table 4.6: Trimmed and constrained MFA estimation on the AIS data set (best results over 30 random 
initializations). Misclassification error rj (in percentage) under different settings 


Cnoise 

io 10 

45 

IO 10 

45 

IO 10 

45 

IO 10 

45 

Cioad 

io 10 

IO 10 

10 

10 

IO 10 

IO 10 

10 

10 

a 

0 

0 

0 

0 

0.05 

0.05 

0.05 

0.05 


0.0891 

0.1881 

0.4554 

0.0842 

0.1039 

0.1782 

0.4505 

0.0149 


We see that the best solution, with only 3 misclassified points, has been obtained by com¬ 
bining trimming (a = 0.05) and constrained estimation of (c noise = 45) and A g (ci oa d = 10), 
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Figure 4.6: Classification of AIS data with fitted trimmed and constrained MFA (left panel), compared to non- 
robust MFA, i.e. the fitted model in 1st row of Table l4~6l (right panel). Misclassified data are denoted by O, trimmed 
data by X. 

with d — 6. The choice of d = 6 has been motivated by performing a factor analysis on the 
observations coming from the group of male athletes, in which we may employ the screw plot 
and test the hypothesis that 6 factors are sufficient, with chi square statistic equal to 97.81 on 4 
degrees of freedom, and p-value= 2.88e — 20. In any case, results with d ^ 6 have also been 
checked. The constraints, and in first place the constraint on \F g , play an important role (com¬ 
pare results in columns 2-4-6 and 8 to the ones displayed in the odd columns), but trimming is 
needed to reach the best result. This is motivated by the fact that the data, as a whole, are not 
following a 11-dimension multivariate Gaussian, as it can be easily checked by performing a 
previous Mardia test. Two results of the fitted models and the subsequent classifications are dis¬ 
played in Figure 14.61 by selecting the 2 variables in the scatterplot that allow a better vision of 
trimmed and misclassified units. We have chosen to represent the best solution (upper panel), 
with only 3 misclassified points, denoted by “O” in the graph, and with 10 trimmed points, 
denoted by “X”. In the lower panel, to make a comparison, we report classification results 
obtained by the fitted model in first row of Table 14.61 In this second case, we were doing an 
almost unconstrained estimation of g and A g and we were not applying trimming, obtaining 
18 misclassified observations. 

The misclassified observations are in rows 70, 73, and 121 in the AIS dataset. Two misclas- 
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sified units are among female athletes, one is among male athletes. The density of the mixture 
components for the observation in position 73 are close (0.000021 and 0.00034), while for the 
other two observations they are neatly different. 

Finally, we recall that trimmed observations were discarded to provide robustness to the 
parameter estimation. After estimating the model, hence, it makes sense to classify also these 
observations. The trimmed observations are in rows 11, 56, 75, 93, 99, 133, 160, 163, 166, 
178, and if we assign them by the Bayes rule to the component g having greater value of 
A,(x; 6 ) = 0 p (x; A g A! y + n g , we classify the first five in the female group of athletes, 
and the second group of five in the male group. This means that all the trimmed observations 
have been assigned to their true group. Table 14.71 shows the details of the classification, and 
Figure l4~7l plots the final result of the robust model fitting. 

Table 4.7: Trimmed units in the AIS data set and their final classification 


unit 

2?i(x; 6) = 0 p (x;/Lti, AiAi + ^ 1 ) 71-1 

D 2 (x; 6) = 0 p (x; ju 2 , A 2 A 2 + ^ 2)772 

Sex 

11 

1.4 e-15 

9.2 e-20 

F 

56 

7.2 e-08 

4.5 e-25 

F 

75 

5.2 e-09 

1.2 e-11 

F 

93 

1.7 e-07 

1.0e-10 

F 

99 

1.2 e-09 

6.4 e-70 

F 

133 

9.8 e-85 

3.2 e-12 

M 

160 

9.9 e-74 

1.5 e-08 

M 

163 

9.9 e-87 

2.0 e-08 

M 

166 

2.2 e-16 

1.4 e-13 

M 

178 

3.1 e-23 

3.8 e-13 

M 


As a last analysis on AIS dataset, we are interested in factor interpretation. 


The rotated factor loading matrices have been obtained by employing a Gradient 


tion algorithm , available through the R package GPArotation (IBernaards and JennrichL 


Browne , 


Projec 


20051: 


2001J). We opted for an oblimin transformation, which yielded results shown in Table 


14.81 We observe that the two groups highlight the same factors, while in a slightly different 
order of importance. The first factor for the group of observations for female athletes, may be 
labelled as a hematological factor, with a very high loading on He, followed by RCC and Hg. 
The second factor, loading heavily on Ht, and in a lesser extent on Wt and LBM, may be de¬ 
noted as a general nutritional status. The third and the fourth factors are related only to Fe and 
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BMI 


Figure 4.7: Classification of AIS data after classifying also trimmed observations. The three misclassified points 
are denoted by O, and they represent only 1 . 5 % of the data. 

BMI, respectively. The fifth factor can be viewed as a overweight assessment index since SSF 
and Bfat load highly on it. The sixth factor is related only to WCC. Noticing that WCC is 
not joined into the hematological factor, we observe that the specific role of lymphocytes, cells 
of the immune system that are involved in defending the body against both infectious disease 
and foreign invaders, seems to be pointed out. Analogous comments may be done on the factor 
loadings for the group of male athletes. 

5 Concluding remarks 

In this paper we propose a robust estimation for the mixture of Gaussian factors model. To 
resist pointwise contamination and sparse outliers that could arise in data collection, we adopt 
and incorporate a trimming procedure along the iterations of the EM algorithm. The key idea is 
that a small portion of observations which are highly unlikely to occur, under the current fitted 
model assumption, are discarded from contributing to the parameter estimates. Furthermore, to 
reduce spurious solutions and avoid singularities of the likelihood, we implement a constrained 
ML estimation for the component covariances. Results from Monte Carlo experiments show 
that bias and MSE of the estimators in several cases of contaminated data are comparable to 
results obtained on data without noise. Finally, the analysis on a real dataset illustrates that 
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Table 4.8: Factor loadings in the A1S data set 


RCC 

WCC 

He 

Hg 

Fe 

BMI 

SSF 

Bfat 

LBM 

Ht 

Wt 


0.697 

0.000 

0.794 

0.682 

0.002 

0.029 

-0.040 

0.014 

0.022 

0.020 

0.029 


rotated Ai (female athletes) 


-0.006 

0.009 

-0.015 

0.021 

-0.005 

-0.008 

- 0.012 

-0.024 

-0.419 

-0.924 

-0.468 


-0.009 

0.000 

0.040 

-0.025 

-0.510 

0.023 

-0.037 

0.013 

0.020 

0.023 

0.023 


-0.055 

-0.015 

-0.004 

0.047 

0.003 

0.644 

0.033 

-0.007 

0.295 

-0.128 

0.330 


0.001 

0.012 

0.010 

- 0.002 

-0.004 

-0.316 

-0.889 

-0.826 

0.054 

-0.076 

-0.235 


-0.035 

-0.941 

0.026 

0.007 

0.000 

-0.057 

-0.017 

0.008 

-0.025 

- 0.002 

-0.031 


RCC 

-0.033 

WCC 

0.003 

He 

0.048 

Hg 

-0.002 

Fe 

0.010 

BMI 

-0.371 

SSF 

-0.616 



Bfat 

-0.610 

LBM 

0.036 

Ht 

0.036 

Wt 

-0.222 


rotated A 2 (male athletes) 


0.663 

0.077 -0.015 

-0.004 

0.024 -0.003 

0.622 

-0.008 0.005 

-0.051 0.009 

0.604 

-0.006 

0.109 

0.027 1.103 

-0.656 0.074 


0.002 0.009 -0.001 

-0.009 0.003 0.007 

0.071 -0.344 0.037 

0.005 0.170 -0.022 

0.071 -0.357 0.042 


-0.025 

1.024 

0.036 

0.001 

-0.004 

0.070 

0.015 

- 0.000 

0.053 

0.009 

0.056 


-0.053 

0.013 

-0.048 

0.079 

0.008 

-0.261 

-0.026 

0.040 

-0.885 

-1.157 

-0.884 


robust estimation leads to better classification and provides direct interpretation of the factor 
loadings. 

Further investigations are needed to tune the choice of the parameters, such as the por¬ 
tion of trimming data and the values of the constraints.Though interesting, this issue is beyond 
the scope of the present paper. Surely, the researcher may specify the partial information he 
may have about the shape of the expected clusters from the data at hand, hence providing 
a part of these param eters. Then, data-dependent diagnostic b ased on trimmed BIC notion s 


(INevkov et al. . 


20071) and/or graphical tools such as the ones in 


Garcia-Escudero et al 


( 2011 ), 
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conveniently adapted to the specific case, could assist in taking appropriate choices for the rest 
of the parameters. The encouraging results here obtained suggest a deeper discussion of these 
implementation details in a future work. 
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